Fragrantica is a website that catalogs perfumes and utilizes website visitors to assign various traits to each perfume. Users can vote on notes, accords, season, and preferred time of wear (day/night). They can also rate and review each perfume. Combined with information provided by the creators of the perfume, a comprehensive description of each fragrance is created for the general public. This project looks to investigate relationships between certain traits of each perfume using the Fragrantica database.
A “note” refers to a single ingredient, while an “accord” is a blend of multiple notes that create a unique scent profile, to build the fragrance’s character. Notes are categorized into top, middle, and base notes, which describe how the scent develops over time. For example, a fragrance might have a floral accord and notes of roses or lilies.
We aimed to explore the relationship between different variables and their relationship to perfumes designed for men, women or both. Looking at the most interesting variables, we hoped to design a model to predict which perfumes are designated for which gender.
library(ggplot2)
library(readxl)
library(knitr)
library(dplyr)
library(finalfit)
library(tidyr)
library(janitor)
library(fastDummies)
library(tree)
library(maptree)
library(randomForest)
library(plotly)
library(factoextra)
library(gbm)
library(dendextend)
library(ROCR)
fragrantica <- read.csv("fra_cleaned.csv", header = TRUE, sep = ";", dec = ",")
This data was taken from the Kaggle data set, “Fragrantica.com Fragrance Dataset”. The dataset includes information on 24,063 perfumes and 18 variables.
Here were the variables in our dataset:
url: the link to the perfume on the Fragrantica
website.
Perfume: the name of the perfume.
Brand: the brand of the perfume.
Country: the country of origin of the
perfume.
Gender: the intended gender audience of the
perfume.
Rating.Value: the rating assigned to the perfume by
website users.
Rating.Count: the number of ratings received by
website users.
Year: the release year of the perfume.
Top: the Top notes of the perfume, organized by
comma-separated list. The number of Top notes in each perfume ranges
from 1-27.
Middle: the Middle notes of the perfume, organized
by comma-separated list. The number of Middle notes in each perfume
ranges from 1-25.
Base: the Base notes of the perfume, organized by
comma-separated list. The number of Base notes in each perfume ranges
from 1-22.
Perfumer1: the name of the main perfumer who created
the perfume.
Perfumer2: the name of the secondary perfumer who
created the perfume. Many perfumes do not have a secondary
perfumer.
mainaccord1 through mainaccord5: the
main accords for each perfume, with the most represented accord being
mainaccord1. Not all perfumes had all 5 main accords. Main
accords were based off of the notes in the perfume.
Our first step was separating the notes out from a comma separated list into individual variables. After doing this, we realized this was impractical for our project due to some perfumes having up to 74 separate notes, and thus 74 separate variables. This also created a lot of missing values, since most perfumes did not have so many notes.
# separate notes
fragrantica_clean <- fragrantica %>% separate_wider_delim(Top, delim = ", ", names_sep = "", too_few = "align_start")
fragrantica_clean <- fragrantica_clean %>% separate_wider_delim(Middle, delim = ", ", names_sep = "", too_few = "align_start")
fragrantica_clean <- fragrantica_clean %>% separate_wider_delim(Base, delim = ", ", names_sep = "", too_few = "align_start")
# missing value plot
fragrantica_cleanval <- fragrantica_clean %>%
mutate(across(where(is.character), ~ na_if(.x, "")))
fragrantica_cleanval %>%
missing_plot()
Where light blue indicated missing values. As you can see, there were great discrepancies between number of notes.
We decided to look at main accords instead, since they were based on the notes in a perfume. Some accords were much more frequently represented than others, so in order to reduce model complexity and overfitting, we found the top 20 main accords. We filtered the data so that only perfumes with the top 20 accords were represented in our model.
# find top 20 accords
accord_list <- fragrantica_clean %>% select(c(mainaccord1, mainaccord2, mainaccord3, mainaccord4, mainaccord5)) %>% pivot_longer(cols = everything(), values_to = "accords") %>% drop_na() %>% filter(accords != "") %>% group_by(accords) %>% summarise(count = n()) %>% top_n(n = 20, wt = count)
ggplot(accord_list, aes(x = reorder(accords, count), y = count)) + geom_bar(stat = "identity", fill = 'thistle') +coord_flip() + theme_minimal() + labs(x="Accords", y="Count", title="Frequency of top 20 accords")
# filter by top 20 accords
fragrantica_filter <- fragrantica_clean %>%
filter(if_all(mainaccord1:mainaccord5, ~ . %in% accord_list$accords | is.na(.))) %>% remove_empty("cols") %>% select(-c(Top1:Base20)) %>% dummy_cols(select_columns = c("mainaccord1", "mainaccord2", "mainaccord3", "mainaccord4", "mainaccord5"), remove_first_dummy = TRUE, remove_selected_columns = TRUE)
# new missing value plot
fragrantica_cleanval2 <- fragrantica_filter %>%
mutate(across(where(is.character), ~ na_if(.x, "")))
fragrantica_cleanval2 %>%
missing_plot()
The main accords have been separated into dummy variables, however the missing data heatmap looks much more practical compared to before. The row of missing values now correspond to year and perfumer2, as some perfumes do not have this data.
We visualized the categorical variables as bar charts and the quantitative variables as histograms to understand the frequencies and distributions of each variable in our dataset.
# Distribution of Gender
fragrantica_filter %>%
count(Gender, sort = TRUE) %>%
ggplot(aes(x = Gender, y = n)) +
geom_bar(stat = 'identity', fill = 'thistle') +
coord_flip() +
labs(
x = 'Gender',
y = 'Count',
title = 'Frequency of Gender'
) +
theme_minimal()
A majority of perfumes were targeted toward women, followed by unisex perfumes.
# distribution of rating scores
ggplot(fragrantica_filter, aes(x=Rating.Value)) + geom_histogram(binwidth = .1, color = 'black', fill = 'thistle') +
theme_minimal()
Scores for the perfumes were approximately normally distributed around 3.9/5.
# distribution of year
ggplot(fragrantica_filter, aes(x=Year)) + geom_histogram(binwidth = 5, color = 'black', fill = 'thistle') +
theme_minimal()
## Warning: Removed 928 rows containing non-finite outside the scale range
## (`stat_bin()`).
Year of release was heavily skewed left, with modern perfumes much more highly represented.
# Distribution of top 20 Brands
fragrantica_filter %>%
group_by(Brand) %>%
summarise(count = n()) %>%
top_n(n = 20, wt = count) %>%
ggplot(aes(x = reorder(Brand, count), y = count)) +
geom_bar(stat = 'identity', fill = 'thistle') +
coord_flip() +
labs(
x = 'Brand',
y = 'Count',
title = 'Frequency of Brand'
) +
theme_minimal()
These were the top 20 perfume brands represented in the dataset.
# Distribution of top 20 Countries
fragrantica_filter %>%
group_by(Country) %>%
summarise(count = n()) %>%
top_n(n = 20, wt = count) %>%
ggplot(aes(x = reorder(Country, count), y = count)) +
geom_bar(stat = 'identity', fill = 'thistle') +
coord_flip() +
labs(
x = 'Country',
y = 'Count',
title = 'Frequency of Country'
) +
theme_minimal()
France dominated the perfume market, followed by the USA.
# boxplot of gender and rating.values
fragrantica_filter %>% ggplot(aes(x=Gender,y=Rating.Value)) + geom_boxplot(fill="thistle") + theme_minimal() + labs(title="Relationship between gender and rating value") + theme(legend.position="none")
According to this box plot, rating values are distributed approximately equal between men’s, unisex, and women’s perfumes.
Based on our exploration, we found that many of the variables did not have any influence on gender. We elected to focus on the relationship between the main accords and gender.
To further explore the data and relationship of interest (main accords and gender), we tried to use unsupervised learning models. Because the main accords (1-5) were factor variables with 20 accords each (due to our sorting), this meant that we would have 100 dummy variables.
In order to do this, we cleaned our data one last time to match our methods. This involved a base level for each variable (mainaccord:amber), with 19 dummy variables to account for the other 19. All other variables were removed, beside the main accords, Perfume and Gender. Since Gender was our response variable, it was excluded for many of the unsupervised methods
frag_model_data <- fragrantica_filter %>% select(-c(url, Brand, Country, Rating.Value, Rating.Count, Year, Perfumer1, Perfumer2)) %>% mutate(Gender=as.factor(Gender)) %>% clean_names()
frag_model_data[,3:97] <- lapply(frag_model_data[,3:97], as.numeric)
First we tried to reduce the dimensions. Since the main accords likely cover the same information in each perfume, they may be redundant. PCA also helped with easing computation.
pr_out <- frag_model_data %>% select(-c(perfume, gender)) %>% prcomp(scale = TRUE, center = TRUE)
pr_var = pr_out$sdev^2
pve=pr_var/sum(pr_var)
par(mfrow=c(1,2))
plot(pve, xlab="Principal Component",
ylab="Proportion of Variance Explained ", ylim=c(0,1),type='b')
plot(cumsum(pve), xlab="Principal Component ",
ylab=" Cumulative Proportion of Variance Explained ", ylim=c(0,1), type='b')
sum(cumsum(pve) < 0.80)
## [1] 63
63 Principal Components are needed to explain 80% of the variabiliy in the data.
rainbow_colors <- rainbow(3)
plot_colors <- rainbow_colors[as.factor(frag_model_data$gender)]
plot(pr_out$x[,1], pr_out$x[,2], col=plot_colors, xlim = c(-6,6),
ylim = c(-6,6), xlab = "PC1", ylab = "PC2")
text(pr_out$x[,1], pr_out$x[,2], labels=frag_model_data$gender,
col = plot_colors, offset=5, cex=0.5)
According to this plot, perfumes of different genders seem to be slightly separated when plotted against Principal Components 1 and 2. Women designated perfumes seem to be more concentrated on the right of the plot (in blue), while Men designated perfumes are more concentrated on the left (in red). Unisex perfumes (in green) seem to be in the middle of the other two groups. This is encouraging as there may be some relationship between a combination of main accords and gender.
Next, using the reduced dimensions, we tried to use K-means clustering to see if the fragrances fell into any easily recognizable groups.
set.seed(4)
frag_model_data_2 <- subset(frag_model_data, select = -c(perfume, gender)) # remove perfume name (not numeric)
pca_data <- as.data.frame(pr_out$x) # Extract PCA components
kmeans_result <- kmeans(pca_data[, 1:63], 3, iter.max = 10, nstart = 50)
fviz_cluster(kmeans_result,
data = frag_model_data_2,
ellipse.type = "convex", # Draw convex hull around clusters
geom = "point", # Show points instead of text
repel = TRUE, # Avoid text overlap
ggtheme = theme_minimal()
) # plot
pca_data$cluster <- as.factor(kmeans_result$cluster)
fig <- plot_ly(pca_data,
x = ~PC1,
y = ~PC2,
z = ~PC3,
color = ~cluster,
colors = c("palevioletred1", "lightgreen", "skyblue"),
type = "scatter3d",
mode = "markers") %>%
layout(scene = list(xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")))
fig # Show the interactive 3D plot
kable(t(kmeans_result$size), caption = "Cluster Group Sizes")
| 4609 | 1582 | 4828 |
Using the first 63 principal components, as we found before, and trying to cluster by k = 3 groups, we get the above clustering. While it does not look very separated in two dimensions (PC1 x PC2), in three dimensions the clusters seem relatively well separated. However the three clusters may not represent the different Gender level breakdowns, and may instead be grouping together fragrances with other shared attributes.
Next, we tried using hierarchal clustering in a similar way.
fragantica_dist <- dist(pca_data[,1:63])
set.seed(1)
frag_hclust = hclust(fragantica_dist, method = "complete")
dend1 = as.dendrogram(frag_hclust)
dend1 = color_branches(dend1, k=3)
dend1 = color_labels(dend1, k=3)
dend1 = set_labels(dend1, labels=frag_model_data$gender[order.dendrogram(dend1)])
dend1 = set(dend1, "labels_cex", 0.3)
plot(dend1, horiz=T, main = "Dendrogram colored by three clusters")
cut_avg <- cutree(frag_hclust, k = 3)
frag_cl <- mutate(frag_model_data, cluster = cut_avg)
kable(count(frag_cl,cluster))
| cluster | n |
|---|---|
| 1 | 10969 |
| 2 | 37 |
| 3 | 13 |
With the 63 PCs as found earlier, and cutting the tree to create k = 3 clusters, we get the above dendogram. There are two extremely small clusters split away early on, with 37 and 13 data points. The other 10969 data points are placed in the last cluster. This suggests that the clusters may not represent the different Gender categories, and may be grouping on some other shared features.
After exploring the data further with unsupervised learning models, next we tried to predict the gender designation of a perfume using supervised models. Since our data comprised of categorical variables and dummy variable coding, we decided to focus on Trees and adjacent models as they would be able to handle less numeric data the best.
#split data into training and test
train = sample(nrow(frag_model_data), 0.75*nrow(frag_model_data))
frag_train = frag_model_data[train, ]
frag_test = frag_model_data[-train,]
tree_frag = tree(gender~.-perfume, data=frag_train)
set.seed(3)
cv = cv.tree(tree_frag, FUN=prune.misclass, K=10)
(best.cv = min(cv$size[cv$dev == min(cv$dev)]))
## [1] 5
pruned_tree <- prune.misclass(tree_frag, best = best.cv)
plot(pruned_tree)
text(pruned_tree, pretty=0, col = "blue", cex = .5)
tree_pred = predict(tree_frag, frag_test, type="class")
First we split the data into 75% training and 25% test data. Then we used k-fold cross validation to find the best tree size, which turns out to be 5. We can see that each of the splits is using an aromatic main accord, where if the perfume has an aromatic main accord 1, 2, or 3, then it is most likely a male perfume. Further, once the tree gets to aromatic main accord 4, if the perfume has that accord, it is most likely unisex, and female otherwise.
tree_err = table(Prediction = tree_pred, Actual = frag_test$gender)
kable(tree_err, caption="Confusion matrix for single decision tree")
| men | unisex | women | |
|---|---|---|---|
| men | 332 | 197 | 109 |
| unisex | 50 | 73 | 57 |
| women | 179 | 439 | 1319 |
# Test accuracy rate
sum(diag(tree_err))/sum(tree_err)
## [1] 0.6257713
# Test error rate (Classification Error)
1-sum(diag(tree_err))/sum(tree_err)
## [1] 0.3742287
This single decision tree has an error rate of 37.42% and an accuracy rate of 62.58%
bag_frag = randomForest(gender ~ .-perfume,
data=frag_train,
mtry=95, importance=TRUE)
yhat_bag = predict(bag_frag, newdata = frag_test)
test_bag_err = mean(yhat_bag != frag_test$gender)
test_bag_err
## [1] 0.384755
table(Prediction = yhat_bag, Actual = frag_test$gender)
## Actual
## Prediction men unisex women
## men 295 181 94
## unisex 153 208 199
## women 113 320 1192
Next we attempted to use multiple decision trees, beginning first with bagging, which is a special case of the random forest where m=p. This means we use all the predictors for each split of the tree. This bagged tree model has an error rate of 38.48%, which is worse than the single decision tree. From looking at the confusion matrix, it predicts better for unsisex perfumes, but worse for male and female perfumes compared to the previous model.
rf_frag = randomForest(gender ~ .-perfume, data=frag_train, importance=TRUE)
rf_frag
##
## Call:
## randomForest(formula = gender ~ . - perfume, data = frag_train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 9
##
## OOB estimate of error rate: 35.07%
## Confusion matrix:
## men unisex women class.error
## men 1144 350 261 0.3481481
## unisex 637 542 1033 0.7549729
## women 235 382 3680 0.1435886
plot(rf_frag)
yhat_rf = predict(rf_frag, newdata =frag_test)
test_rf_err = mean(yhat_rf != frag_test$gender)
test_rf_err
## [1] 0.3357532
table(Prediction = yhat_rf, Actual = frag_test$gender)
## Actual
## Prediction men unisex women
## men 365 203 88
## unisex 103 171 103
## women 93 335 1294
The random forest considered 9 variables at each split of the tree, and 500 trees were used to fit the data. The out of bag estimate of error rate is 35.07%, and the test error rate is 33.58%. There is a slight improvement over bagging in this case.
head(importance(rf_frag))
## men unisex women MeanDecreaseAccuracy
## mainaccord1_aquatic 15.105001 0.01626316 -0.5442219 10.2727904
## mainaccord1_aromatic 53.503606 -15.88654198 38.2508460 55.0389295
## mainaccord1_citrus 3.307418 11.25272430 9.9124414 14.8864241
## mainaccord1_earthy -2.664173 0.63522811 0.2085678 -0.3741032
## mainaccord1_floral 35.300983 11.33389072 47.5289956 54.5525432
## mainaccord1_fresh 1.803723 1.54490798 2.8073521 3.7168451
## MeanDecreaseGini
## mainaccord1_aquatic 7.828905
## mainaccord1_aromatic 154.872662
## mainaccord1_citrus 45.113326
## mainaccord1_earthy 1.889110
## mainaccord1_floral 101.275916
## mainaccord1_fresh 3.986819
varImpPlot(rf_frag, sort=T,
main="Variable Importance for Random Forests", n.var=5)
We also took a look at the most important accords. According to MeanDecreaseAccuracy, the most important accords are mainaccord1_aromatic, mainaccord1_floral, mainaccord2_floral, mainaccord2_aromatic, mainaccord1_white_floral. For MeanDecreaseGini, the most important accords are mainaccord1_aromatic, mainaccord2_aromatic, mainaccord1_floral, mainaccord3_aromatic, and mainaccord2_fresh_spicy.
set.seed(1)
boost_frag = gbm(gender~.-perfume, data=frag_train,
distribution="multinomial", n.trees=500, interaction.depth=2)
head(summary(boost_frag))
## var rel.inf
## mainaccord1_aromatic mainaccord1_aromatic 8.343913
## mainaccord2_aromatic mainaccord2_aromatic 7.870312
## mainaccord3_aromatic mainaccord3_aromatic 4.500041
## mainaccord1_floral mainaccord1_floral 4.192445
## mainaccord1_fruity mainaccord1_fruity 3.005525
## mainaccord2_fresh_spicy mainaccord2_fresh_spicy 2.987877
yhat_boost = predict(boost_frag, newdata = frag_test,
n.trees=500, type = "response")
yhat_boost = yhat_boost[,,1]
yhat_boost_resp = colnames(yhat_boost)[apply(yhat_boost, 1, which.max)]
test_boost_err = mean(yhat_boost_resp != frag_test$gender)
test_boost_err
## [1] 0.3292196
table(Predictions = yhat_boost_resp, Actual = frag_test$gender)
## Actual
## Predictions men unisex women
## men 333 168 64
## unisex 144 222 128
## women 84 319 1293
Finally we fit a boosting model. According to the relative importance plot, the top 5 most important accords are mainaccords1, 2, and 3 aromatic, followed by mainaccord1_floral and mainaccord1_fruity. This model has a test error rate of 32.92%, which is a further improvement on the random forest model. Looking at the confusion matrix, this model predicts relatively well for male and female perfumes, and not as well for unisex perfumes.
actual <- frag_test$gender
classes <- colnames(yhat_boost)
auc_values <- c()
colors <- rainbow(length(classes))
for (i in 1:length(classes)) {
actual_binary <- as.numeric(actual == classes[i])
pred <- prediction(yhat_boost[, i], actual_binary)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = colors[i], add = i != 1, main = "ROC Curves for Each Class from Boosting Model")
auc <- performance(pred, "auc")@y.values[[1]]
auc_values[i] <- auc
}
abline(0,1, lty=2)
legend("bottomright", legend = paste(classes, "AUC:", round(auc_values, 3)),
col = colors, lwd = 2, cex = 0.8)
names(auc_values) <- classes
kable(auc_values, caption="AUC values")
| x | |
|---|---|
| men | 0.8785287 |
| unisex | 0.6990985 |
| women | 0.8613860 |
Since there are three levels of response, we used a one-vs-all approach to plot the ROC curves and calculate AUC. The AUC for perfumes targeted towards men is 0.8785, for women it is 0.8614, and for unisex perfumes it is 0.6991.
After exploring various models, both unsupervised and supervised, we seem to have found some relationship between the main accords 1-5 of a perfume, as well as the gender they are targeted toward. The best model for predicting the gender audience for a perfume was the boosting model, as it had the lowest test error rate of 32.92%. The boosting model is an improvement on single decision trees, decreasing variance. The AUC of this model is 0.8785 (men), 0.8614 (women), and 0.6991 (unisex).
Since our predictor variables were all dummy variables of the main accords, conducting PCA was less effective. The first principal component, which explains the most variability across all PCs, only explained 2.5% of the variability in the data. Thus, many PCs were necessary in order explain most of the variability in the data.
According to our single decision tree model and the most influential variables in our Random Forests and Boosting, the biggest impact on whether a scent was for women or for men was if it’s first two main accords were aromatic, with more aromatic fragrances targeted toward men. The best model does pretty well with fragrances designated for women and men, however it struggles with unisex scents. This is what we would expect, as unisex scents likely have characteristics similar to both male and female perfumes. This the model tries to categorize them as either for women or men, instead of creating a new category.
This dataset was challenging to work with in a statistical machine learning context, due to many categorical variables each with many different levels. This is not ideal for many parametric methods, and thus this project focused on non-parametric methods to make our predictions. Despite these challenges, we were able to learn about the relationship between various characteristics of perfumes that make them distinguishable as for women or for men. This project could be improved by addressing levels and factors in a better way than coding dummy variables for each level and factor, or by incorporating numerical data into the predictors. Further improvements can also be made with addressing the unisex fragrances and better separating them from women’s and men’s perfumes.